| 18 | 17 | 21 | 18 | 17 | 16 | 19 | 24 | 15 | 24 | 18 | 19 |
| 22 | 20 | 21 | 18 | 17 | 20 | 20 | 22 | 20 | 18 | 19 | 22 |
L2 - Statistics
This lecture is heavily inspired by that of Louis Sirugue.
2026-01-20
Or click here: link to Wooclap
1.4 Relationship between variables
The point of descriptive statistics is to summarize a big table of values with a small set of tractable statistics
The most comprehensive way to characterize a variable/vector is to compute its distribution:
\(\rightarrow\) Consider for instance the following variable:
| 18 | 17 | 21 | 18 | 17 | 16 | 19 | 24 | 15 | 24 | 18 | 19 |
| 22 | 20 | 21 | 18 | 17 | 20 | 20 | 22 | 20 | 18 | 19 | 22 |
| Variable 1 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 24 |
| n | 1 | 1 | 3 | 5 | 3 | 4 | 2 | 3 | 2 |
| 17.53074 | 21.15684 | 16.66565 | 18.58788 | 15.34944 | 17.67870 |
| 22.40558 | 20.96827 | 16.97330 | 20.00811 | 19.92690 | 19.37334 |
| 16.73665 | 17.81842 | 16.13361 | 23.81101 | 23.74077 | 18.61935 |
| 20.37507 | 17.89235 | 20.02870 | 21.82973 | 18.04681 | 22.41368 |
\(\rightarrow\)Let’s have a look at the corresponding bar plot
\(\rightarrow\) In practice some variables are difficult to classify. E.g., age (in years) can be viewed as a discrete variable because it takes a finite set of values, but this set is possibly quite wide, one could view it as a continuous variable. It often depends on the context.
- We can divide the domain of this variable into 5 bins
- And count the number of observations within each bin
Consider for instance the following variable. For clarity each point is shifted vertically by a random amount
We can divide the domain of this variable into 5 bins
Consider for instance the following variable. For clarity each point is shifted vertically by a random amount
We can divide the domain of this variable into 5 bins
And count the number of observations within each bin
Consider for instance the following variable. For clarity each point is shifted vertically by a random amount
We can divide the domain of this variable into 5 bins
And count the number of observations within each bin
\(\rightarrow\) Note that choosing the number of bins is equivalent to choosing the width of each bin
How to summarize these distributions with simple statistics?
How to summarize these distributions with simple statistics?
How to summarize these distributions with simple statistics?
1.1 Distributions \(\checkmark\)
1.4 Relationship between variables
| 30 | 30 | 30 | 41 | 30 | 34 | 28 |
| 32 | 30 | 31 | 33 | 34 | 28 | 35 |
The mean is simply the sum of all the countries’ hours of worked divided by the number of countries:
\[\bar{x} = \frac{1}{N}\sum_{i = 1}^Nx_i\]
\[\frac{30 + 30 + 30 + 41 + 30 + 34 + 28 + 32 + 30 + 31 + 33 + 34 + 28 + 35}{14} = 31.86\]
| 30 | 30 | 30 | 41 | 30 | 34 | 28 |
| 32 | 30 | 31 | 33 | 34 | 28 | 35 |
Note that it can also be expressed as the sum of each value weighted by its proportion in the distribution
\[ \begin{multline} \bar{x} = \frac{2}{14} \times 28 + \frac{5}{14} \times 30 + \frac{1}{14} \times 31 + \frac{1}{14} \times 32 + \frac{1}{14} \times 33 + \frac{2}{14} \times 34 \\+ \frac{1}{14} \times 35 + \frac{1}{14} \times 41 = 31.86 \end{multline} \]
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
| 28 | 28 | 30 | 30 | 30 | 30 | 30 | 31 | 32 | 33 | 34 | 34 | 35 | 41 |
As we have 14 observations, the median is the average of the 7th and the 8th observations:
\[\text{Med}(x) = \begin{cases} x[\frac{N+1}{2}] & \text{if } N \text{ is odd}\\ \frac{x[\frac{N}{2}]+x[\frac{N}{2}+1]}{2} & \text{if } N \text{ is even} \end{cases} = \frac{30 + 31}{2} = 30.5\]
The relative magnitude of the mean and the median depends on the symmetry of the distribution:
\(\rightarrow\) income, wealth
\(\rightarrow\) height, exam scores (ideally)
\(\rightarrow\) age at death, time spent on social media
The median is less sensitive than the mean to thick tails and outliers
For this reason we say that the median is a robust statistic
Let’s illustrate that with a small example!
Consider the following variable:
| -3 | -2 | -2 | -1 | -1 | -1 | 0 | 1 | 1 | 1 | 2 | 2 | 3 |
How would the mean and the median react if we were to add one single observation?
Both statistics have dedicated R functions
\(\rightarrow\) mean for mean and median for median:
Consider the following binary variable of length 4
| 0 | 1 | 1 | 1 |
\[\frac{0 + 1 + 1 + 1}{4} = \frac{3}{4} = 75\%\]
\[\frac{1 + 1}{2} = 1\]
1.1 Distributions \(\checkmark\)
1.2 Central tendency \(\checkmark\)
1.4 Relationship between variables
The most intuitive statistic to describe the spread of a variable is probably
But consider the following two distributions:
In the presence of outliers or very skewed distributions, the full range of a variable may not be representative of what we mean by ‘spread’
That’s why we tend to prefer inter-quantile ranges
| -3 | -2 | -1 | 0 | 1 | 2 | 3 |
\[Q_1 = -2,\:\:Q_2 = 0,\:\:Q_3 = 2\]
| -3 | -2 | -1 | 0 | 0 | 1 | 2 | 3 |
\[Q_1 = -1.5,\:\:Q_2 = 0,\:\:Q_3 = 1.5\]
The interquartile range is the difference between the third and the first quartile: \(\text{IQR} = Q_3 - Q_1\)
Put differently, it corresponds to the bounds of the set which contains the middle half of the distribution
Can we just take the average deviation from the mean?
| x | mean(x) | x - mean(x) |
|---|---|---|
| 1 | 2.5 | -1.5 |
| 4 | 2.5 | 1.5 |
| -3 | 2.5 | -5.5 |
| 8 | 2.5 | 5.5 |
By construction it would always be 0: values above and under the mean compensate
But we can use the absolute value of each deviation: \(|x_i-\bar{x}|\)
Or their square: \((x_i-\bar{x})^2\)
The variance is computed by averaging the squared deviations from the mean:
\[\text{Var}(x) = \frac{1}{N}\sum_{i = 1}^N(x_i-\bar{x})^2\]
Because the variance is a sum of squares, it can get quite big compared to the other statistics like the mean, the median or the interquartile range.
To express the spread in the same unit as the data, we can take the square root of the variance, which is called the standard deviation:
\[\text{SD}(x) = \sqrt{\text{Var}(x)} = \sqrt{\frac{1}{N}\sum_{i = 1}^N(x_i-\bar{x})^2}\]
Remember that the median is less sensitive than the mean to thick tails and outliers
This is also the case for the IQR relative to the standard deviation
Let’s go back to our previous example!
Consider the following variable:
| -3 | -2 | -2 | -1 | -1 | -1 | 0 | 1 | 1 | 1 | 2 | 2 | 3 |
These two distributions
Have the same interquartile range
Have different standard deviations
Both statistics have dedicated R functions
\(\rightarrow\) sd for the standard deviation and IQR for the interquartile range
1.1 Distributions \(\checkmark\)
1.2 Central tendency \(\checkmark\)
1.3 Spread \(\checkmark\)
1.4 Relationship between variables
How can we characterize the relationship between two variables?
There are two main statistics:
The covariance
The correation
\[ Cov(x,y) = \frac{1}{N} \sum_{i=1}^{N} (x_i - \bar x)(y_i - \bar y) \]
\[ Corr(x,y) = \frac{Cov(x,y)}{\sqrt{Var(x)}\sqrt{Var(y)}} \]
Both statistics have dedicated R functions
\(\rightarrow\) cov for covariance and cor for correlation:
Load data on global working hours in 2023 from Gethin and Saez (2025). You can find it here.
Compute the mean of hours_worked (hours per adult) and hours_worker (hours per worker). Why is there such a difference?
Compute the median of hours_worked and hours_worker. Is there a big difference with the mean? What does this suggest about the distribution of (average) working hours across countries?
Compute the three quartiles of the distribution of hours_worker. What’s the 75th percentile? Try to guess which country this is.
Compute the interquartile range of hours_worker_men and hours_worker_women. What do you notice?
Compute the correlation between hours_worked and tax_labor. (Hint: remember what NA stands for.)
1. Summarizing data \(\checkmark\)
1.1 Distributions \(\checkmark\)
1.2 Central tendency \(\checkmark\)
1.3 Spread \(\checkmark\)
1.4 Relationship between variables \(\checkmark\)
tidyverse package suitThe tidyverse is a collection of packages that facilitate data manipulation and visualization.
We will use dplyr to manipulate data and ggplot2 to visualize data.
You don’t need to load each package individually, instead you can just:
dplyr verbsdplyr is a grammar of data manipulation providing very user-friendly functions to handle the most common data manipulation tasks:
mutate(): add/modify variablesselect(): keep/drop variables (columns)filter(): keep/drop observations (rows)arrange(): sort rows according to given variable(s)summarise(): aggregate data into statisticsgroup_by(): compute the above verbs by groupsA very handy operator to use with the dplyr grammar is the pipe |>1:
dplyr: an exampleWe will use data on global working hours, but this time since 1900. You can find the data here.
global_working_hours <- read.csv("https://www.dropbox.com/scl/fi/aq6nlnuun9o8bk86h89a2/hours_worked_panel_clean.csv?rlkey=f6jqgl8swkvf6kgl1z9eqczmr&dl=1")
str(global_working_hours)'data.frame': 2565 obs. of 9 variables:
$ country : chr "United Arab Emirates" "United Arab Emirates" "United Arab Emirates" "United Arab Emirates" ...
$ region : chr "Middle East and North Africa" "Middle East and North Africa" "Middle East and North Africa" "Middle East and North Africa" ...
$ year : int 2017 2018 2019 2022 2023 2007 2008 2011 2012 2013 ...
$ hours_worked : num 44.1 40.9 40.8 37.9 37.1 ...
$ hours_worked_men : num 49 47.2 47.3 44.5 43.2 ...
$ hours_worked_women: num 32 26.4 28.5 25.8 25.3 ...
$ hours_worker : num 55.9 51.7 52.1 50.4 48.4 ...
$ hours_worker_men : num 53.6 51.4 52.1 50.5 48.3 ...
$ hours_worker_women: num 65.2 53.2 52.2 50.2 48.4 ...
dplyr: an example# A tibble: 2,565 × 9
country region year hours_worked hours_worked_men hours_worked_women
<chr> <chr> <int> <dbl> <dbl> <dbl>
1 United Arab Em… Middl… 2017 44.1 49.0 32.0
2 United Arab Em… Middl… 2018 40.9 47.2 26.4
3 United Arab Em… Middl… 2019 40.8 47.3 28.5
4 United Arab Em… Middl… 2022 37.9 44.5 25.8
5 United Arab Em… Middl… 2023 37.1 43.2 25.3
6 Afghanistan Middl… 2007 22.6 31.8 13.1
7 Afghanistan Middl… 2008 22.1 31.1 12.8
8 Afghanistan Middl… 2011 20.2 35.2 4.68
9 Afghanistan Middl… 2012 19.5 34.1 4.55
10 Afghanistan Middl… 2013 16.4 27.7 5.01
# ℹ 2,555 more rows
# ℹ 3 more variables: hours_worker <dbl>, hours_worker_men <dbl>,
# hours_worker_women <dbl>
dplyr: an example# A tibble: 2,565 × 3
country year hours_worked
<chr> <int> <dbl>
1 United Arab Emirates 2017 44.1
2 United Arab Emirates 2018 40.9
3 United Arab Emirates 2019 40.8
4 United Arab Emirates 2022 37.9
5 United Arab Emirates 2023 37.1
6 Afghanistan 2007 22.6
7 Afghanistan 2008 22.1
8 Afghanistan 2011 20.2
9 Afghanistan 2012 19.5
10 Afghanistan 2013 16.4
# ℹ 2,555 more rows
dplyr: an example# A tibble: 2,565 × 4
country year hours_worker hours_worked_greater_40
<chr> <int> <dbl> <lgl>
1 United Arab Emirates 2017 55.9 TRUE
2 United Arab Emirates 2018 51.7 TRUE
3 United Arab Emirates 2019 52.1 TRUE
4 United Arab Emirates 2022 50.4 TRUE
5 United Arab Emirates 2023 48.4 TRUE
6 Afghanistan 2007 36.5 FALSE
7 Afghanistan 2008 35.8 FALSE
8 Afghanistan 2011 42.9 TRUE
9 Afghanistan 2012 42.3 TRUE
10 Afghanistan 2013 37.4 FALSE
# ℹ 2,555 more rows
dplyr: an example# A tibble: 55 × 4
country year hours_worker hours_worked_greater_40
<chr> <int> <dbl> <lgl>
1 France 1968 44.8 TRUE
2 France 1969 44.2 TRUE
3 France 1970 43.2 TRUE
4 France 1971 43.0 TRUE
5 France 1972 42.9 TRUE
6 France 1973 42.6 TRUE
7 France 1974 42.3 TRUE
8 France 1975 40.4 TRUE
9 France 1976 40.7 TRUE
10 France 1977 40.5 TRUE
# ℹ 45 more rows
dplyr: an example# A tibble: 1 × 2
mean_hours num_greater_40
<dbl> <int>
1 35.8 11
mutate functionsmutate():ifelse()
global_working_hours |>
select(country, year) |>
mutate(post_2000 = ifelse(year >= 2000,
"21th century",
"20th century")) |>
head()# A tibble: 6 × 3
country year post_2000
<chr> <int> <chr>
1 United Arab Emirates 2017 21th century
2 United Arab Emirates 2018 21th century
3 United Arab Emirates 2019 21th century
4 United Arab Emirates 2022 21th century
5 United Arab Emirates 2023 21th century
6 Afghanistan 2007 21th century
case_when()
global_working_hours |>
select(country, year) |>
mutate(year_gp = case_when(year <= 1950 ~ "1900-1950",
year %in% 1951:1999 ~ "1951-1999",
year >= 2000 ~ "2000-2023")) |>
head()# A tibble: 6 × 3
country year year_gp
<chr> <int> <chr>
1 United Arab Emirates 2017 2000-2023
2 United Arab Emirates 2018 2000-2023
3 United Arab Emirates 2019 2000-2023
4 United Arab Emirates 2022 2000-2023
5 United Arab Emirates 2023 2000-2023
6 Afghanistan 2007 2000-2023
1. Summarizing data \(\checkmark\)
1.1 Distributions \(\checkmark\)
1.2 Central tendency \(\checkmark\)
1.3 Spread \(\checkmark\)
1.4 Relationship between variables \(\checkmark\)
2.1 dplyr verbs \(\checkmark\)
group_by() and mutateWith group_by() you can perform computations separately for the different categories of a variable
global_working_hours |>
select(year, country, region, hours_worker) |>
mutate(mean_all = mean(hours_worker)) |>
arrange(country, year) |>
head(10)# A tibble: 10 × 5
year country region hours_worker mean_all
<int> <chr> <chr> <dbl> <dbl>
1 2007 Afghanistan Middle East and North Africa 36.5 39.5
2 2008 Afghanistan Middle East and North Africa 35.8 39.5
3 2011 Afghanistan Middle East and North Africa 42.9 39.5
4 2012 Afghanistan Middle East and North Africa 42.3 39.5
5 2013 Afghanistan Middle East and North Africa 37.4 39.5
6 2014 Afghanistan Middle East and North Africa 37.7 39.5
7 2017 Afghanistan Middle East and North Africa 38.0 39.5
8 2020 Afghanistan Middle East and North Africa 42.0 39.5
9 2021 Afghanistan Middle East and North Africa 38.7 39.5
10 2002 Albania Eastern Europe and ex-USSR 37.8 39.5
global_working_hours |>
select(year, country, region, hours_worker) |>
group_by(region) |>
mutate(mean_reg_yr = mean(hours_worker)) |>
arrange(country, year) |>
head(10)# A tibble: 10 × 5
# Groups: region [2]
year country region hours_worker mean_reg_yr
<int> <chr> <chr> <dbl> <dbl>
1 2007 Afghanistan Middle East and North Africa 36.5 44.1
2 2008 Afghanistan Middle East and North Africa 35.8 44.1
3 2011 Afghanistan Middle East and North Africa 42.9 44.1
4 2012 Afghanistan Middle East and North Africa 42.3 44.1
5 2013 Afghanistan Middle East and North Africa 37.4 44.1
6 2014 Afghanistan Middle East and North Africa 37.7 44.1
7 2017 Afghanistan Middle East and North Africa 38.0 44.1
8 2020 Afghanistan Middle East and North Africa 42.0 44.1
9 2021 Afghanistan Middle East and North Africa 38.7 44.1
10 2002 Albania Eastern Europe and ex-USSR 37.8 38.2
group_by() and mutateWith group_by() you can perform computations separately for the different categories of a variable
global_working_hours |>
select(year, country, region, hours_worker) |>
mutate(mean_all = mean(hours_worker)) |>
arrange(country, year) |>
count(region, mean_all)# A tibble: 8 × 3
region mean_all n
<chr> <dbl> <int>
1 East and Southeast Asia 39.5 246
2 Eastern Europe and ex-USSR 39.5 503
3 Latin America 39.5 515
4 Middle East and North Africa 39.5 167
5 South Asia 39.5 64
6 Sub-Saharan Africa 39.5 159
7 United States 39.5 123
8 Western Europe and Anglosphere 39.5 788
global_working_hours |>
select(year, country, region, hours_worker) |>
group_by(region) |>
mutate(mean_reg_yr = mean(hours_worker)) |>
arrange(country, year) |>
count(region, mean_reg_yr)# A tibble: 8 × 3
# Groups: region [8]
region mean_reg_yr n
<chr> <dbl> <int>
1 East and Southeast Asia 43.3 246
2 Eastern Europe and ex-USSR 38.2 503
3 Latin America 42.5 515
4 Middle East and North Africa 44.1 167
5 South Asia 45.6 64
6 Sub-Saharan Africa 41.3 159
7 United States 42.8 123
8 Western Europe and Anglosphere 34.9 788
count is a very useful dplyr function for (cross-)tabulationgroup_by() and summarisesummarise()
# A tibble: 8 × 2
region mean_hours
<chr> <dbl>
1 East and Southeast Asia 43.3
2 Eastern Europe and ex-USSR 38.2
3 Latin America 42.5
4 Middle East and North Africa 44.1
5 South Asia 45.6
6 Sub-Saharan Africa 41.3
7 United States 42.8
8 Western Europe and Anglosphere 34.9
mutate() \(\neq\) summarise()
mutate() takes an operation that converts:
summarise() takes an operation that converts:
group_by() and summarisesummarise()
# A tibble: 8 × 2
region mean_hours
<chr> <dbl>
1 East and Southeast Asia 43.3
2 Eastern Europe and ex-USSR 38.2
3 Latin America 42.5
4 Middle East and North Africa 44.1
5 South Asia 45.6
6 Sub-Saharan Africa 41.3
7 United States 42.8
8 Western Europe and Anglosphere 34.9
group_by() using .by()dplyr recently introduced the .by operator:
# A tibble: 8 × 2
region mean_hours
<chr> <dbl>
1 Middle East and North Africa 44.1
2 Eastern Europe and ex-USSR 38.2
3 Sub-Saharan Africa 41.3
4 Latin America 42.5
5 Western Europe and Anglosphere 34.9
6 South Asia 45.6
7 East and Southeast Asia 43.3
8 United States 42.8
Clean alternative to group_by
Works with all (relevant) dplyr verbs and with multiple variables c(var1, var2)
No need to ungroup()
1. Summarizing data \(\checkmark\)
1.1 Distributions \(\checkmark\)
1.2 Central tendency \(\checkmark\)
1.3 Spread \(\checkmark\)
1.4 Relationship between variables \(\checkmark\)
2.1 dplyr verbs \(\checkmark\)
2.2 group_by \(\checkmark\)
_join()The last dplyr verb we will cover enable merging datasets together, a very common operation.
They come in different varieties:
- full_join():
- inner_join():
- left_join():
- right_join():
_join()Imagine you have 2023 working hours in one dataset and GDP in another:
working_hours <- read.csv("https://www.dropbox.com/scl/fi/dl9dpac6pgl9tje02buzw/hours_worked_panel_clean_only_hours.csv?rlkey=dbfce160u6d653ohhiy9zijb9&dl=1")
gdp <- read.csv("https://www.dropbox.com/scl/fi/tqx6qlb3zqna4hmtevsvw/gdp.csv?rlkey=hkgxwgjurd9n54r264x2dxquo&dl=1")
working_hours |>
head(10) country year hours_worked
1 United Arab Emirates 2017 44.07154
2 United Arab Emirates 2018 40.94445
3 United Arab Emirates 2019 40.82006
4 United Arab Emirates 2022 37.88909
5 United Arab Emirates 2023 37.12018
6 Afghanistan 2007 22.55852
7 Afghanistan 2008 22.08803
8 Afghanistan 2011 20.15557
9 Afghanistan 2012 19.52261
10 Afghanistan 2013 16.40564
_join()Imagine you have 2023 working hours in one dataset and GDP in another:
country year gdp
1 United Arab Emirates 2017 704732266496
2 United Arab Emirates 2018 713991258112
3 United Arab Emirates 2019 721905123328
4 United Arab Emirates 2022 772207411200
5 United Arab Emirates 2023 798492196864
6 Afghanistan 2007 71506411520
7 Afghanistan 2008 70340698112
8 Afghanistan 2011 87721918464
9 Afghanistan 2012 96194863104
10 Afghanistan 2013 102418915328
_join()Imagine you have 2023 working hours in one dataset and GDP in another:
working_hours <- read.csv("https://www.dropbox.com/scl/fi/dl9dpac6pgl9tje02buzw/hours_worked_panel_clean_only_hours.csv?rlkey=dbfce160u6d653ohhiy9zijb9&dl=1")
gdp <- read.csv("https://www.dropbox.com/scl/fi/tqx6qlb3zqna4hmtevsvw/gdp.csv?rlkey=hkgxwgjurd9n54r264x2dxquo&dl=1")
working_hours |>
left_join(gdp) |>
head(10) country year hours_worked gdp
1 United Arab Emirates 2017 44.07154 704732266496
2 United Arab Emirates 2018 40.94445 713991258112
3 United Arab Emirates 2019 40.82006 721905123328
4 United Arab Emirates 2022 37.88909 772207411200
5 United Arab Emirates 2023 37.12018 798492196864
6 Afghanistan 2007 22.55852 71506411520
7 Afghanistan 2008 22.08803 70340698112
8 Afghanistan 2011 20.15557 87721918464
9 Afghanistan 2012 19.52261 96194863104
10 Afghanistan 2013 16.40564 102418915328
tidylog packagedplyr is extremely powerful and convenient but it has one drawback: it provides no information when performing operations
tidylog is the solution to this problem!
Joining with `by = join_by(country, year)`
left_join: added one column (gdp)
> rows only in working_hours 0
> rows only in gdp ( 0)
> matched rows 2,565
> =======
> rows total 2,565
country year hours_worked gdp
1 United Arab Emirates 2017 44.07154 704732266496
2 United Arab Emirates 2018 40.94445 713991258112
3 United Arab Emirates 2019 40.82006 721905123328
4 United Arab Emirates 2022 37.88909 772207411200
5 United Arab Emirates 2023 37.12018 798492196864
6 Afghanistan 2007 22.55852 71506411520
7 Afghanistan 2008 22.08803 70340698112
8 Afghanistan 2011 20.15557 87721918464
9 Afghanistan 2012 19.52261 96194863104
10 Afghanistan 2013 16.40564 102418915328
Load data on global working hours since 1900 from Gethin and Saez (2025). You can find it here.
Tabulate the number of observations (countries) per year.
Which countries had average workers hours per week (hours_worker) greater than 48 in 2023?
In which years did at least one Latin American country have average workers hours between 30 and 32 hours. What happened that year?
Compute the max of hours_worker for countries in the Western Europe and Anglosphere and United States regions, separately, in 2023. (Hint: the %in% operator will come in handy.)
Compute the average difference in hours worked per adult (hours_worked) between men and women in 1990, 2000, 2010 and 2020. What do you observe?
When things do not work the way you want, NAs are the usual suspects
For instance, this is how the mean function reacts to NAs:
You should systematically check for NAs!
Don’t pipe blindfolded!
Oftentimes things don’t work either because:
This is precisely what help files are made for
Search on the internet/use AI!
Sometimes R breaks and returns an error (usually kind of cryptic)
Error: '\U' used without hex digits in character string (<input>:1:14)
What to do:
1. Summarizing data \(\checkmark\)
1.1 Distributions \(\checkmark\)
1.2 Central tendency \(\checkmark\)
1.3 Spread \(\checkmark\)
1.4 Relationship between variables \(\checkmark\)
2. Manipulating data \(\checkmark\)
2.1 dplyr verbs \(\checkmark\)
2.2 group_by \(\checkmark\)
2.3 Joining data \(\checkmark\)
ggplot2 packageggplot2 is a grammar of graphics providing a very powerful and user-friendly visualization tool.
We used it in lecture 1, without going through the different elements
Let’s use the global working hours since 1900 data.
# A tibble: 2,565 × 9
country region year hours_worked hours_worked_men hours_worked_women
<chr> <chr> <int> <dbl> <dbl> <dbl>
1 United Arab Em… Middl… 2017 44.1 49.0 32.0
2 United Arab Em… Middl… 2018 40.9 47.2 26.4
3 United Arab Em… Middl… 2019 40.8 47.3 28.5
4 United Arab Em… Middl… 2022 37.9 44.5 25.8
5 United Arab Em… Middl… 2023 37.1 43.2 25.3
6 Afghanistan Middl… 2007 22.6 31.8 13.1
7 Afghanistan Middl… 2008 22.1 31.1 12.8
8 Afghanistan Middl… 2011 20.2 35.2 4.68
9 Afghanistan Middl… 2012 19.5 34.1 4.55
10 Afghanistan Middl… 2013 16.4 27.7 5.01
# ℹ 2,555 more rows
# ℹ 3 more variables: hours_worker <dbl>, hours_worker_men <dbl>,
# hours_worker_women <dbl>
# A tibble: 55 × 9
country region year hours_worked hours_worked_men hours_worked_women
<chr> <chr> <int> <dbl> <dbl> <dbl>
1 France Western Europ… 1968 24.5 34.5 15.3
2 France Western Europ… 1969 24.0 33.6 15.1
3 France Western Europ… 1970 23.5 32.9 14.8
4 France Western Europ… 1971 23.2 32.5 14.5
5 France Western Europ… 1972 23.3 32.4 14.8
6 France Western Europ… 1973 23.1 32.1 14.9
7 France Western Europ… 1974 23.0 31.7 15.0
8 France Western Europ… 1975 21.8 29.8 14.4
9 France Western Europ… 1976 21.7 29.6 14.5
10 France Western Europ… 1977 21.7 29.3 14.7
# ℹ 45 more rows
# ℹ 3 more variables: hours_worker <dbl>, hours_worker_men <dbl>,
# hours_worker_women <dbl>
scale + _ + <aes> + _ + <type> + ()
What parameter do you want to adjust? \(\rightarrow\) <aes>
What type is the parameter? \(\rightarrow\) <type>
I want to change my discrete x-axis \(\rightarrow\) scale_x_discrete()
I want to change range of point sizes from continuous variable \(\rightarrow\) scale_size_continuous()
I want to rescale y-axis as log10 \(\rightarrow\) scale_y_log10()
I want to use a different color palette \(\rightarrow\) scale_fill_discrete() / scale_color_manual()
ggplot2Each graph is different and ggplot2 provides a zillion options to customize your graph to perfection.
Excellent cheatsheet on project website.
Cédric Scherer’s wonderful Graphic Design with ggplot2 tutorial
1. Summarizing data \(\checkmark\)
1.1 Distributions \(\checkmark\)
1.2 Central tendency \(\checkmark\)
1.3 Spread \(\checkmark\)
1.4 Relationship between variables \(\checkmark\)
2. Manipulating data \(\checkmark\)
2.1 dplyr verbs \(\checkmark\)
2.2 group_by \(\checkmark\)
2.3 Joining data \(\checkmark\)
3.1 gg is for Grammar of Graphics \(\checkmark\)
The type of graph you need depends on whether:
Let’s start with plotting just one variable.
How many countries per year?
How many countries per region?
Histogram of hours per worker in 2022
Histogram of hours per worker in 2022
Histogram of hours per worker in 2022
Histogram of hours per worked in 2022
Histogram of hours per worked in 2022
Histogram of hours per worked in 2022
Density of hours per worker in 2022
Boxplots: displays the distribution of a variable, with the median, mean and interquartile range.
Box plot of hours per worker in 2022 by region
Use the data from the previous task to produce the following plots:
A histogram of hours per female worker (hours_worker_women) in 2022. Once you’ve created the histogram, within the appropriate geom_* set: binwidth to 5, boundary to 30, colour to “white” and fill to “#785EF0”. What does each of these options do? Optional: using the previous graph, facet it by region such that each region’s plot is a new row. (Hint: check the help for facet_grid.)
A boxplot of average hours worked hours_worked per year by region. Within the appropriate geom_* set: colour to “black” and fill to “#785EF0”. (Hint: you need to group by both region and year.)
A scatter plot of fertility rate (y-axis) with respect to infant mortality (x-axis) in 2015. Once you’ve created the scatter plot, within the appropriate geom_* set: size to 3, alpha to 0.5, colour to “#d90502”. Add labels (labs) to the plot so that it is cleaner.
1. Summarizing data \(\checkmark\)
1.1 Distributions \(\checkmark\)
1.2 Central tendency \(\checkmark\)
1.3 Spread \(\checkmark\)
1.4 Relationship between variables \(\checkmark\)
2. Manipulating data \(\checkmark\)
2.1 dplyr verbs \(\checkmark\)
2.2 group_by \(\checkmark\)
2.3 Joining data \(\checkmark\)
3.1 gg is for Grammar of Graphics \(\checkmark\)
3.2 Different types of gaphs \(\checkmark\)
Lecture 2: Summarizing and Visualizing Data